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Abstract 

We introduce a new distance-based phylogeny reconstruction technique which provably achieves, at 
sufficiently short branch lengths, a polylogarithmic sequence-length requirement — improving significantly 
over previous polynomial bounds for distance-based methods. The technique is based on an averaging 
procedure that implicitly reconstructs ancestral sequences. 

In the same token, we extend previous results on phase transitions in phylogeny reconstruction to general 
time-reversible models. More precisely, we show that in the so-called Kesten-Stigum zone (roughly, a region 
of the parameter space where ancestral sequences are well approximated by "linear combinations" of the 
observed sequences) sequences of length poly(log n) suffice for reconstruction when branch lengths are 
discretized. Here n is the number of extant species. 

Our results challenge, to some extent, the conventional wisdom that estimates of evolutionary distances 
alone carry significantly less information about phylogenies than full sequence datasets. 
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*The current manuscript is the full version with proofs of [Roc08]. In subsequent work [Roc09] the results stated here were improved 
to logarithmic sequence length, thereby matching the best results for general methods, 
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1 Introduction 



The evolutionary history of a group of organisms is generally represented by a phylogenetic tree or phy- 
togeny [Fel04, SS03]. The leaves of the tree represent the current species. Each branching indicates a speciation 
event. Many of the most popular techniques for reconstructing phylogenies from molecular data, e.g. UPGMA, 
Neighbor- Joining, and BIO-NJ [SS63, SN87, Gas97], are examples of what are known as distance-matrix meth- 
ods. The main advantage of these methods is their speed, which stems from a straightforward approach: 1) the 
estimation of a distance matrix from observed molecular sequences; and 2) the repeated agglomeration of the 
closest clusters of species. Each entry of the distance matrix is an estimate of the evolutionary distance between 
the corresponding pair of species, that is, roughly the time elapsed since their most recent common ancestor. 
This estimate is typically obtained by comparing aligned homologous DNA sequences extracted from the extant 
species — the basic insight being, the closer the species, the more similar their sequences. Most distance meth- 
ods run in time polynomial in n, the number of leaves, and in k, the sequence length. This performance com- 
pares very favorably to that of the other two main classes of reconstruction methods, likelihood and parsimony 
methods, which are known to be computationally intractable [GF82, DS86, Day87, MV05, CT06, Roc06]. 

The question we address in this paper is the following: Is there a price to pay for this speed and simplicity? 
There are strong combinatorial [SHP88] and statistical [Fel04] reasons to believe that distance methods are 
not as accurate as more elaborate reconstruction techniques, notably maximum likelihood estimation (MLE). 
Indeed, in a typical instance of the phylogenetic reconstruction problem, we are given aligned DNA sequences 
{(Q)i=i}le L, one sequence for each leaf I G L, from which we seek to infer the phylogeny on L. Generally, all 
sites (£l)ieL, for i = 1, . . . , k, are assumed to be independent and identically distributed according to a Markov 
model on a tree (see Section 1.1). For a subset W C L, we denote by fiyv the distribution of (Q)i<=w under 
this model. Through their use of the distance matrix, distance methods reduce the data to pairwise sequence 
correlations, that is, they only use estimates of = {^w '■ W C L, \W\ = 2}. In doing so, they seemingly 
fail to take into account more subtle patterns in the data involving three or more species at a time. In contrast, 
MLE for example outputs a model that maximizes the joint probability of all observed sequences. We call 
methods that explicitly use the full dataset, such as MLE, holistic methods. 

It is important to note that the issue is not one of consistency: when the sequence length tends to infinity, 
the estimate provided by distance methods — just like MLE — typically converges to the correct phylogeny. In 
particular, under mild assumptions, it suffices to know the pairwise site distributions fi2 to recover the topology 
of the phylogeny [CH91, Cha96]. Rather the question is: how fast is this convergence? Or more precisely, how 
should k scale as a function of n to guarantee a correct reconstruction with high probability? And are distance 
methods significantly slower to converge than holistic methods? Although we do not give a complete answer 
to these questions of practical interest here, we do provide strong evidence that some of the suspicions against 
distance methods are based on a simplistic view of the distance matrix. In particular, we open up the surprising 
possibility that distance methods actually exhibit optimal convergence rates. 

Context. It is well-known that some of the most popular distance-matrix methods actually suffer from a 
prohibitive sequence-length requirement [Att99, LC06]. Nevertheless, over the past decade, much progress 
has been made in the design of fast-converging distance-matrix techniques, starting with the seminal work of 
Erdos et al. [ESSW99a]. The key insight behind the algorithm in [ESSW99a], often dubbed the Short Quartet 
Method (SQM), is that it discards long evolutionary distances, which are known to be statistically unreliable. 
The algorithm works by first building subtrees of small diameter and, in a second stage, putting the pieces back 
together. The SQM algorithm runs in polynomial time and guarantees the correct reconstruction with high 
probability of any phylogeny (modulo reasonable assumptions) when k = poly(n). This is currently the best 
known convergence rate for distance methods. (See also [DMR06, DHJ+06, Mos07, GMS08, DMR09b] for 
faster-converging algorithms involving partial reconstruction of the phylogeny.) 
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Although little is known about the sequence-length requirement of MLE [SS99, SS02], recent results of 
Mossel [Mos04], Daskalakis et al. [DMR06, DMR09a], and Mihaescu et al. [MHR09] on a conjecture of 
Steel [SteOl] indicate that convergence rates as low as k = 0(log n) can be achieved when the branch lengths 
are sufficiently short and discretized, using insights from statistical physics. We briefly describe these results. 

As mentioned above, the classical model of DNA sequence evolution is a Markov model on a tree that 
is closely related to stochastic models used to study particle systems [Lig85, Geo88]. This type of model 
undergoes a phase transition that has been extensively studied in probability theory and statistical physics: at 
short branch lengths (in the binary symmetric case, up to 15% divergence per edge), in what is called the 
reconstruction phase, good estimates of the ancestral sequences can be obtained from the observed sequences; 
on the other hand, outside the reconstruction phase, very little information about ancestral states diffuses to 
the leaves. See e.g. [EKPSOO] and references therein. The new algorithms in [Mos04, DMR06, DMR09a, 
MHR09] exploit this phenomenon by alternately 1) reconstructing a few levels of the tree using distance- 
matrix techniques and 2) estimating distances between internal nodes by reconstructing ancestral sequences at 
the newly uncovered nodes. The overall algorithm is not distance -based, however, as the ancestral sequence 
reconstruction is performed using a complex function of the observed sequences named recursive majority. 
The rate k = 0(log n) achieved by these algorithms is known to be necessary in general. Moreover, the slower 
rate k = poly(n) is in fact necessary for all methods — distance-based or holistic — outside the reconstruction 
phase [Mos03]. In particular, note that distance methods are in some sense "optimal" outside the reconstruction 
phase by the results of [ESSW99a]. 

Beyond the oracle view of the distance matrix. It is an outstanding open problem to determine whether dis- 
tance methods can achieve k = 0(log n) in the reconstruction phase 1 . From previous work on fast-converging 
distance methods, it is tempting to conjecture that k = poly(n) is the best one can hope for. Indeed, all previous 
algorithms use the following "oracle view" of the distance matrix, as formalized by King et al. [KZZ03] and 
Mossel [Mos07]. As mentioned above, the reliability of distance estimates depends on the true evolutionary 
distances. From standard concentration inequalities, it follows that if leaves a and b are at distance r(a, b), then 
the usual distance estimate f (a, b) (see Section 1.1) satisfies: 

if r(a, b) < D + e or f (a, b) < D + e then |r(a, b) - f (a, b)\ < e, (1) 

for e, D such that k oc (1 — e~ e )~ 2 e 2D . Fix e > small and k <C poly(n). Let T be a complete binary tree with 
log 2 n levels. Imagine that the distance matrix is given by the following oracle: on input a pair of leaves (a, b) 
the oracle returns an estimate f (a, b) which satisfies (1). Now, notice that for any tree T" which is identical to 
T on the first log 2 n/2 levels above the leaves, the oracle is allowed to return the same distance estimate as for 
T. That is, we cannot distinguish T and T' in this model unless k = poly(n). (This argument can be made 
more formal along the lines of [KZZ03].) 

What the oracle model ignores is that, under the assumption that the sequences are generated by a Markov 
model of evolution, the distance estimates (f (a, b)) a ,&e[n] are i n f act correlated random variables. More con- 
cretely, for leaves a, b, c, d, note that the joint distribution of (f(a, b), f (c, d)) depends in a nontrivial way 
on the joint site distribution /j,yv at W = {a, b, c, d}. In other words, even though the distance matrix is — 
seemingly — only an estimate of the pairwise correlations /x 2 , it actually contains some information about all 
joint distributions. Note however that it is not immediately clear how to exploit this extra information or even 
how useful it could be. 

As it turns out, the correlation structure of the distance matrix is in fact very informative at short branch 
lengths. More precisely, we introduce in this paper a new distance -based method with a convergence rate of 
k = poly(logn) in the reconstruction phase (to be more accurate, in the so-called Kesten-Stigum phase; see 

'Mike Steel offers a 100S reward for the solution of this problem. 
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below) — improving significantly over previous poly(n) results. Note that the oracle model allows only the 
reconstruction of a o(l) fraction of the levels in that case. Our new algorithm involves a distance averaging 
procedure that implicitly reconstructs ancestral sequences, thereby taking advantage of the phase transition 
discussed above. We also obtain the first results on Steel's conjecture beyond the simple symmetric models 
studied by Daskalakis et al. [DMR06, DMR09a, MHR09] (the so-called CFN and Jukes-Cantor models). In 
the next subsections, we introduce general definitions and state our results more formally. We also give an 
overview of the proof. 

Further related work. For further related work on efficient phylogenetic tree reconstruction, see [ESSW99b, 
HNW99, CK01, Csu02]. 

1.1 Definitions 

Phylogenies. We define phylogenies and evolutionary distances more formally. 

Definition 1 (Phylogeny) A phylogeny is a rooted, edge-weighted, leaf-labeled tree T = (V, E, [n], p; r) 
where: V is the set of vertices; E is the set of edges; L = [n] = {0, ... , n — 1} is the set of leaves; p is 
the root; r : E — > (0, +oo) is a positive edge weight function. We further assume that all internal nodes in T 
have degree 3 except for the root p which has degree 2. We let Y n be the set of all such phylogenies on n leaves 
and we denote Y = {Y n }„>i. 

Definition 2 (Tree Metric) For two leaves a,b £ [n], we denote by Path(a, b) the set of edges on the unique 
path between a and b. A tree metric on a set [n] is a positive function d : [n] x [n] — > (0, +oo) such that 
there exists a tree T = (V, E) with leaf set [n] and an edge weight function w : E — ► (0, +oo) satisfying the 
following: for all leaves a,b £ [n] 

d(a,b) = ^2 w e . 

e£Path(a,fe) 

For convenience, we denote by (r (a, b) ) a b< - ri the tree metric corresponding to the phylogeny T = (V, E, [n] , p;r). 
We extend t(u, v) to all vertices u,v £ V in the obvious way. 

Example 1 (Homogeneous Tree) For an integer h > 0, we denote by T^ 1 ) = (yCO, E^ h \ L^ h \ pW; r) a 
rooted phylogeny where is the h-level complete binary tree with arbitrary edge weight function r and 
L (h) = pfc]. For < h' < h, we let Lw' be the vertices on level h — h! (from the root). In particular, 
L (h) = L(h) md L (h) = Ip{h) y We ht HY = | MYn } „>i be the set of all phylogenies with homogeneous 
underlying trees. 

Model of molecular sequence evolution. Phylogenies are reconstructed from molecular sequences extracted 
from the observed species. The standard model of evolution for such sequences is a Markov model on a tree 
(MMT). 

Definition 3 (Markov Model on a Tree) Let $ be a finite set of character states with (p = |<J>|. Typically 
<E> = {+1, —1} or $ = {A, G,C,T}. Let n > 1 and let T = (V,E, [n],p) be a rooted tree with leaves 
labeled in [n]. For each edge e £ E, we are given a ip x <p stochastic matrix M e = (M^)jj e $, with fixed 
stationary distribution tt = (7Tj)j e <j>. An MMT ({M e } ee E, T) associates a state a v in $ to each vertex v in V 
as follows: pick a state for the root p according to tt; moving away from the root, choose a state for each vertex 
v independently according to the distribution (M^ with e = (u, v) where u is the parent of v. 



4 



The most common MMT used in phylogenetics is the so-called general time-reversible (GTR) model. 

Definition 4 (GTR Model) Let <& be a set of character states with tp = | <J> | and ir be a distribution on $ 
satisfying 7Tj > for all i G $. For n > 1, let T = (V, E, [n], p; r) be a phytogeny. Let Q be a (p x <p rate 
matrix, that is, Qij > Ofor all i ^ j and 

E^ = < 

for all iei Assume Q is reversible with respect to ir, that is, 

Qij — ^ j Qji ; 

for all i,j G 3>. 77ze G77? model on T with rate matrix Q is an MMT on T = (V, E, [n], p) with transition 
matrices M e = e Te ®, for all e G E. By the reversibility assumption, Q has (p real eigenvalues 

= Ai > A 2 > • • • > K v . 

We normalize Q by fixing A 2 = — 1. We denote by the set of all such rate matrices. We let G n( n = Y n (8> 
be the set of all p-state GTR models on n leaves. We denote = {G„ i¥ >} n>1 . We denote by £vf the vector of 
states on the vertices W C V. In particular, ^[ n ] are the states at the leaves. We denote by Cr,Q the distribution 
of£[ n ]- 

GTR models include as special cases many popular models such as the CFN model. 
Example 2 (CFN Model) The CFN model is the GTR model with <p = 2, ir = (1/2, 1 /2), and 

Example 3 (Binary Asymmetric Channel) More generally, letting $ = {+,— } arcd 7r = (7r + ,7r_), wjY/i 
7r + , 7r_ > 0, we can take 

Q=(~ 7r - *~ V 



Phylogenetic reconstruction. A standard assumption in molecular evolution is that each site in a sequence 
(DNA, protein, etc.) evolves independently according to a Markov model on a tree, such as the GTR model 
above. Because of the reversibility assumption, the root of the phylogeny cannot be identified and we recon- 
struct phylogenies up to their root. 

Definition 5 (Phylogenetic Reconstruction Problem) Let Y = {Y n } n >i be a subset of phylogenies and 
be a subset of rate matrices on ip states. Let T = (V, E, [n], p; r) £ Y, IfT= (V, E, [n], p) is the rooted 
tree underlying T, we denote by T_\T] the tree T where the root is removed: that is, we replace the two 
edges adjacent to the root by a single edge. We denote by T n the set of all leaf-labeled trees on n leaves 
with internal degrees 3 and we let T = {T n }„>i. A phylogenetic reconstruction algorithm is a collection of 
maps A = {A nt k}n,k>i from sequences (C[ n j)i=i £ (&^) k to leaf-labeled trees T G T n . We only consider 
algorithms A computable in time polynomial in n and k. Let k{n) be an increasing function of n. We say that 
A solves the phylogenetic reconstruction problem on Y (g) with sequence length k = k(n) if for all 5 > 0, 
there is uq > 1 such that for all n > % T £ Y n , Q G Q„, 



> 1 -6, 



where (C n ])^=i are i-i-d- samples from £t,q- 
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An important result of this kind was given by Erdos et al. [ESSW99a]. 

Theorem 1 (Polynomial Reconstruction [ESSW99a]) Let < / < g < +00 and denote by Y-^' 3 the set 

of all phytogenies T = (V, E, [n], p; r) satisfying f < r e < g, Ve G E. Then, for all if > 2 and all 
< / < g < +00, the phylogenetic reconstruction problem on Y?' 9 ® Q v can be solved with k = poly(n). 

This result was recently improved by Daskalakis et al. [DMR06, DMR09a] (see also [MHR09]) in the so-called 
Kesten-Stigum reconstruction phase, that is, when g < In y/2. 

Definition 6 (A-Branch Model) Let 0<A<f<g< +00 and denote by Y^ 9 the set of all phytogenies 
T = ( V 7 , E, [n] , p; r) satisfying f < r e < g where r e is an integer multiple of IS., for all e G E. For ip > 2 and 
Q £ Q v> we call Y^ 9 ® {Q} A-Branch Model (A-BM). 

Theorem 2 (Logarithmic Reconstruction [DMR06, DMR09a]. See also [MHR09].) Let g* = In y/2. Then, 
for all < A < / < g < g*, the phylogenetic reconstruction problem on Y^ 9 {Q CFN } can be solved with 
k = 0(logn) 2 . 

Distance methods. The proof of Theorem 1 uses distance methods, which we now define formally. 

Definition? (Correlation Matrix) Let $ be a finite set with 99 > 2. Let (£*)f =1 , (Cb)i=i £ ® k t>e the se- 
quences at a,b £ [n]. For v\,V2 £ <£, we define the correlation matrix between a and b by 

k 
i=l 

and F ab = (^ W ) U1M64 . 

Definition 8 (Distance Method) A phylogenetic reconstruction algorithm A = {A rij k}n,k>i is said to be 
distance-based if A depends on the data (C[ n ])i=i £ ($^) fc only through the correlation matrices {F ab } a ^ n y 

The previous definition takes a very general view of distance -based methods: any method that uses only pair- 
wise sequence comparisons. In practice, most distance -based approaches actually use a specific distance esti- 
mator, that is, a function of F ab that converges to r(a, b) in probability as n — > +00. We give two classical 
examples below. 

Example 4 (CFN Metric) In the CFN case with state space $ = {+,—}, a standard distance estimator (up 
to a constant) is 

2?(F) = -ln(l-2(F+_ + F_+)). 

Example 5 (Log-Det Distance [BH87, Lak94, LSHP94, Ste94]) More generally, a common distance estima- 
tor (up to scaling) is the so-called log-det distance 

V(F) = -ln|detF|. 

Loosely speaking, the log-det distance can be thought as a generalization of the CFN metric. We will use a 
different generalization of the CFN metric. See section 1.3. 

2 The correct statement of this result appears in [DMR09a]. Because of different conventions, our edge weights are scaled by a factor 
of 2 compared to those in [DMR09a]. The dependence of k in A is A^ 2 . 
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1.2 Results 



In our main result, we prove that phylogenies under GTR models of mutation can be inferred using a distance- 
based method from k = poly (log n). 

Theorem 3 (Main Result) For all ip > 2, < A < / < g < g* and Q 6 Q^, there is a distance-based 
method solving the phylogenetic reconstruction problem on Y^ 9 ® {Q} with k = poly (log n). 3 

Note that this result is a substantial improvement over Theorem 1 — at least, in a certain range of parameters — 
and that it almost matches the bound obtained in Theorem 2. The result is also novel in two ways over The- 
orem 2: only the distance matrix is used; the result applies to a larger class of mutation matrices. A slightly 
weaker version of the result stated here appeared without proof as [Roc08]. Note that in [Roc08] the result was 
stated without the discretization assumption which is in fact needed for the final step of the proof. This is further 
explained in Section 7.3 of [DMR09a]. In subsequent work [Roc09], the result stated here was improved to 
logarithmic sequence length, thereby matching Theorem 2. This new result follows a similar high-level proof 
but involves stronger concentration arguments [PR09], as well as a simplified algorithm. 

In an attempt to keep the paper as self-contained as possible we first give a proof in the special case of 
homogeneous trees. This allows to keep the algorithmic details to a minimum. The proof appears in Section 3. 
We extend the result to general trees in Appendix A. The more general result relies on a combinatorial algorithm 
of [DMR09a]. 



1.3 Proof Overview 

Distance averaging. The basic insight behind Steel's conjecture is that the accurate reconstruction of ances- 
tral sequences in the reconstruction phase can be harnessed to perform a better reconstruction of the phylogeny 
itself. For now, consider the CFN model with character space {+1,-1} and assume that our phylogeny is ho- 
mogeneous with uniform branch lengths u. Generate k i.i.d. samples {a % v )^ =1 . Let a, b be two internal vertices 
on level h — h' < h (from the root). Suppose we seek to estimate the distance between a and b. This estimation 
cannot be performed directly because the sequences at a and b are not known. However, we can try to estimate 
these internal sequences. Denote by A, B the leaf set below a and b respectively. An estimate of the sequence 
at a is the (properly normalized) "site-wise average" of the sequences at A 

°° = \A\ ^ F^V' (2) 

1 1 a'dA 

for % = 1, . . . , k, and similarly for b. It is not immediately clear how such a site-wise procedure involving simul- 
taneously a large number of leaves can be performed using the more aggregated information in the correlation 
matrices {F uv } u ,ve[n\- Nevertheless, note that the quantity we are ultimately interested in computing is the 
following estimate of the CFN metric between a and b 

fl k 

f(a,b) = - In UX^X 
3 As in Theorem 2, the dependence of A: in A is A~ 2 . 
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Our results are based on the following observation: 




) 



) 



where note that the last line depends only on distance estimates f (a', b') between leaves a' , b' in A, B. In other 
words, through this procedure, which we call exponential averaging, we perform an implicit ancestral sequence 
reconstruction using only distance estimates. One can also think of this as a variance reduction technique. When 
the branch lengths are not uniform, one needs to use a weighted version of (2). This requires the estimation of 
path lengths. 

GTR models. In the case of GTR models, the standard log-det estimator does not lend itself well to the 
exponential averaging procedure described above. Instead, we use an estimator involving the right eigenvector 
v corresponding to the second eigenvalue A2 of Q. For a, b € [n], we consider the estimator 



This choice is justified by a generalization of (2) introduced in [MP03]. Note that v may need to be estimated. 

Concentration. There is a further complication in that to obtain results with high probability, one needs to 
show that f(a, b) is highly concentrated. However, one cannot directly apply standard concentration inequalities 
because a a is not bounded. Classical results on the reconstruction problem imply that the variance of a a 
is finite — which is not quite enough. To amplify the accuracy, we "go down" log log n levels and compute 
O(logn) distance estimates with conditionally independent biases. By performing a majority procedure, we 
finally obtain a concentrated estimate. 

1.4 Organization 

In Section 2, we provide a detailed account of the connection between ancestral sequence reconstruction and 
distance averaging. We then give a proof of our main result in the case of homogeneous trees in Section 3. 
Finally, in Appendix A, we give a sketch of the proof in the general case. 
All proofs are relegated to Appendix B. 

2 Ancestral Reconstruction and Distance Averaging 

Let (,2>2,0<A</<g<g* = m y/2, and Q G Q v with corresponding stationary distribution 
7r > 0. In this section we restrict ourselves to the homogeneous case T = = (V, E, [n],p; r) where we 
take h = log 2 n and / < r e < g and r e is an integer multiple of A, Ve G E. (See Examples 1 and 2 and 
Theorem 2.) 4 

4 Note that, without loss of generality, we can consider performing ancestral state reconstruction on a homogeneous tree as it is 
always possible to "complete" a general tree with zero-length edges. We come back to this point in Appendix A. 




(3) 



8 



Throughout this section, we use a sequence length k > log K n where k > 1 is a constant to be determined 
later. We generate k i.i.d. samples (^y)f =1 from the GTR model (T, Q) with state space <£. 
All proofs are relegated to Appendix B. 

2.1 Distance Estimator 

The standard log-det estimator does not lend itself well to the averaging procedure discussed above. For recon- 
struction purposes, we instead use an estimator involving the right eigenvector v corresponding to the second 
eigenvalue A2 of Q. For a, b € [n], consider the estimator 

f(a,b) = -In (y r F ab v} , (4) 

where the correlation matrix F ab was introduced in Definition 7. We first give a proof that this is indeed a 
legitimate distance estimator. For more on connections between eigenvalues of the rate matrix and distance 
estimation, see e.g. [GL96, GL98, GMY09]. 

Lemma 1 (Distance Estimator) Let f be as above. For all a,b € [n], we have 

E[ e - f (°' 6 )] = e" T(a ' b) . 

For a € [n] and i = 1 , . . . , k, let 
Then (4) is equivalent to 

f(a,b) = - In [^Yl^^j ■ 



(5) 



Note that in the CFN case, we have simply v = (1, — 1) T and hence (5) can be interpreted as a generalization 
of the CFN metric. 
Let 

7f = min7r t , 

1 

and 

1 

v = max I vi I < —— . 

i V7T 

The following lemmas show that the distance estimate above with sequence length poly (log n) is concentrated 
for path lengths of order 0(log log n). 

Lemma 2 (Distorted Metric: Short Distances) Let 5 > 0, e > 0, and 7 > 0. There exists k > 1, such that if 
the following conditions hold for u, v £ V: 

• [Small Diameter] t(u,v) < Jloglog(n), 

• [Sequence Length] k > log K (n), 
then 

\t(u, v) — t(u, v)\ < e, 

with probability at least 1 — n r . 
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Lemma 3 (Distorted Metric: Diameter Test) Let D > 0, W > 5, 5 > 0, and 7 > 0. Let u, v £ V not 

descendants of each other. Let uq,vq £ V be descendants of u, v respectively, there exists n > 1, such that if 
the following conditions hold: 

• [Large Diameter] t(uq, vq) > D + In W, 

• [Close Descendants] t(uq, u), t(vq, v) < c)loglog(n), 

• [Sequence Length] k > log K (n), 

then 

W 

f(u, v) - r(n , u) - t(v , v) > D + In — , 
with probability at least 1 — n~ 7 . On the other hand, if the first condition above is replaced by 

• [Small Diameter] t(uq,vq) < D + In 
then 

W 

f(u, v) - t(u , u) - t(v , v) < D + In — , 
with probability at least 1 — rT~ 1 . 

2.2 Ancestral Sequence Reconstruction 

Let e = (x,y) G E and assume that x is closest to p (in topological distance). We define Path(p, e) = 
Path(p, y), \e\ p = | Path (v, e)|, and 

i2 p (e) = (l-^)e-J, 

where 9^,^ = e~ T ^ and 6> e = e" T ( e ). 

Proposition 1 below is a variant of Lemma 5.3 in [MP03]. For completeness, we give a proof. 

Proposition 1 (Weighted Majority: GTR Version) Let £r n i be a sample from £t,q (see Definition 4) with 
corresponding <j[ n j. For a unit flow ^ from p to [n], consider the estimator 

s= ^^ (7x 

xe[n] @p ' x 



Then, we have 

and 
where 



E[S] = 0, 
E[5|crp] = a p , 

Yai[S] = 1 + K 9 , 

eeE 
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Let f be a unit flow from p to [n]. We will use the following multiplicative decomposition of \I>: If 
> 0, we let 

if ^ *^ 

^ = WY 

and, if instead fy(x) = 0, we let ip(y) = 0. Denoting x-f the immediate ancestor of x G F and letting 
= e - ^'^, it will be useful to re-write 

*.-5c e<>-«> n <« 

ft^Oj.gitk) eePath(p,x) e 

and to define the following recursion from the leaves. For x G [n], 

= 0. 

Then, let u G F — [n] with children ui, ^2 with corresponding edges ei, e2 and define 



a=l,2 V e « ' 



Note that, from (6), we have -Kp,* = Ky. 

Lemma 4 (Uniform Bound on Kq,) Let ^ be the homogeneous flow from p to [n]. Then, we have 



< 7n— t — 7 < +oo. 

1 _ e -2(g*-g) 



2.3 Distance Averaging 



The input to our tree reconstruction algorithm is the matrix of all estimated distances between pairs of leaves 
{f (a, fr)} a ,f>,e[n] - F° r short sequences, these estimated distances are known to be accurate for leaves that are 
close enough. We now show how to compute distances between internal nodes in a way that involves only 
{f (a, fr)} ai f> ig [ n ] (and previously computed internal weights) using Proposition 1. However, the second-moment 
guarantee in Proposition 1 is not enough to obtain good estimates with high probability. To remedy this situa- 
tion, we perform a large number of conditionally independent distance computations, as we now describe. 

Let a > 1 and assume h! > [a log 2 log 2 nj . Let < h" < h! such that Ah = h! — h" = [a log 2 log 2 n\ . 
Let ao,&o £ L%) . For x G {a, b}, denote by the vertices in l£' that are below xq and, for 

j = 1, . . . , 2 Ah , let Xj be the leaves of T^ 1 ) below Xj. See Figure 1. Assume that we are given @ ao ,-, @6 , - 
For 1 < j < 2 Ah , we estimate r(ao, &o) as follows 



T( aj ,bj) = -In 



f 1 v V e^ 1 e- 1 e - f(a '' b "> I 



This choice of estimator is suggested by the following observation 



— e —r(ao,a ; ,')-T(6o,6j) 
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A\ Aj ^4 2 Ah ^1 Bj _B 2 Ah 



Figure 1 : Accuracy amplification. 



T{a ,b ) 




Figure 2: Examples of dense balls with six points. The estimate in this case would be one of the central points. 
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Note that the first line depends only on estimates (f(u,v)) UjVe [ n ] and {@ v ,- }veV ao uv bo - The last line is the 
empirical distance between the reconstructed states at a and b when the flow is chosen to be uniform in Propo- 
sition 1. 

For d G M and r > 0, let B r (d) be the ball of radius r around d. We define the dense ball around j to be 
the smallest ball around f(cij, bj) containing at least 2/3 of J = {f(ajr, bji)} 2 -, =l (as a multiset). The radius 
of the dense ball around j is 

r* = inf{r : \B r (f( aj , b 3 )) n J\ > \{2 Ah ) 

for j = 1, . . . ,2 Ah . We define our estimate of r(ao, fro) to be f'(ao, bo) = f(aj*, bj*), where j* = argmin^r*. 
See Figure 2. For D > 0, W > 5, we define 



SD(a ,6 ) = 1 < 2 



-Aft 



j : r(oj, 6j) < D + In 



l 

> 2 



We extend Lemmas 2 and 3 to r'(ao, 6o) an d §ID>( a o> 6o) 



Proposition 2 (Deep Distance Computation: Small Diameter) Let a > 1, D > 0, 7 > 0, and e > 0. Le? 

ao, foo £ -kfc/^ 05 above. There exist k > 1 smc/i f/iaf if the following conditions hold: 

• [Small Diameter] r(ao,&o) < A 

• [Sequence Length] fc > log K (n), 

\f'(ao,bo) - T(a ,b )\ < e, 

with probability at least 1 — 0(n~ 7 ). 



Proposition 3 (Deep Distance Computation: Diameter Test) Let a > 1, D > 0, W > 5, arad 7 > 0. Le? 
ao, 60 £ ^ft/'' as a b° ve - There exists k > 1 swc/i f/iaf if the following conditions hold: 

• [Large Diameter] r(ao, 60) > f + In W, 

• [Sequence Length] k > log K (n), 
then 

SO(a ,6o) = 0, 

with probability at least 1 — rT 1 . On the other hand, if the first condition above is replaced by 

• [Small Diameter] t(qq, bo) < D + In 
then 

SD(a ,6 ) = 1, 

with probability at least 1 — rT 1 . 
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3 Reconstructing Homogeneous Trees 



In this section, we prove our main result in the case of homogeneous trees. More precisely, we prove the 
following. 

Theorem 4 (Main Result: Homogeneous Case) Let 0<A<f<g< +00 and denote by HY^ 3 the set of 
all homogeneous phytogenies T = (V, E, [n], p; r) satisfying f < r e < g and r e is an integer multiple of A, 
Ve € E. Let g* = In \/2. Then, for all if > 2, < A < / < g < g* and Q 6 Q^, there is a distance-based 
method solving the phylogenetic reconstruction problem on MY^ 9 <g> {Q} with k = poly (log n). 

All proofs are relegated to Appendix B. 

In the homogeneous case, we can build the tree level by level using simple "four-point" techniques [Bun71]. 
See e.g. [SS03, Fel04] for background and details. See also Section 3.2 below. The underlying combinatorial 
algorithm we use here is essentially identical to the one used by Mossel in [Mos04]. From Propositions 2 and 3, 
we get that the "local metric" on each level is accurate as long as we compute adequate weights. We summarize 
this fact in the next proposition. For A > and z G R+, we let [z]a be the closest multiple of A to z (breaking 
ties arbitrarily). We define 



Proposition 4 (Deep Distorted Metric) Let D > 0, W > 5, and 7 > 0. Let T = (V, E, [n],p; r) e MY^ 9 
with g < g*. Let ao,bo G L^) for < h! < h. Assume we are given, for x = a,b, 9 e for all e G V x . There 
exists k > 0, such that if the following condition holds: 

• [Sequence Length] The sequence length is k > log K (ra), 

then we have, with probability at least 1 — 0(n~ 7 ), 



under either of the following two conditions: 

1. [Small Diameter] r(ao, 60) < D, or 

2. [Finite Estimate] d(ao,&o) < +°°- 

It remains to show how to compute the weights, which is the purpose of the next section. 

3.1 Estimating Averaging Weights 

Proposition 4 relies on the prior computation of the weights 9 e for all e € V x , for x = a,b. In this section, we 
show how this estimation is performed. 

Let ao, bo, Co G Denote by z the meeting point of the paths joining ao, bo, Cq. We define the "three- 

point" estimate 




d(a ,6o) = r(a ,6o) 
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z,a — 



O(a ;bo,c ) = exp --[d(a ,6o) +d(a ,c ) - d(6 ,c )] 



Note that the expression in parenthesis is an estimate of the distance between ao and z. 
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Proposition 5 (Averaging Weight Estimation) Let ao, bo, cq G L ( h / as above. Assume that the assumptions 
of Propositions 2, 3, 4 hold. Assume further that the following condition hold: 

• [Small Diameter] r(ao, bo), r(ao, cq), r(bo, cq) < D + lnW, 
w/fft probability at least 1 — 0(n~ 7 ) w/zere 2;ao = O(ao; 6q, Cq). 



3.2 Putting it All Together 

Let < h! < h and Q = {ao, bo, cq, do} C L^. The topology of restricted to Q is completely 
characterized by a bi-partition or quartet split q of the form: aobo\codo, aoco|foo^o or aodo\boco- The most basic 
operation in quartet-based reconstruction algorithms is the inference of such quartet splits. In distance-based 
methods in particular, this is usually done by performing the so-called four-point test: letting 



F(aobo\c d ) 



1 



r(a ,c ) +T(b ,d ) -r(a ,6o) -r(co,d )], 



we have 



a bo\c do if F(ao, b \c , d ) > 
a co\bodo if F{ao, b \c , d ) < 
aodo\b co o.w. 

Of course, we cannot compute JF(ao, 6o|co, do) directly unless h! = 0. Instead we use Proposition 4. 



Deep Four-Point Test. Assume we have previously computed weights 9 e for all e G Vi, for x = a,b, c, d. 
We let | 

^"(aobolco^o) = 2p(°0' c o) +d(6 ,^o) -d(a ,6o) -d(c ,d )], (7) 
and we define the deep four-point test 

W{ao,bo\co,do) = l{^(a 6 |c do) > //2}, 

with FP(ao, 6o|co, do) = if any of the distances in (7) is infinite. Also, we extend the diameter test SD to 
arbitrary subsets by letting §D(<S) = 1 if and only if SD(x, y) = 1 for all pairs x,y E S. 



Algorithm. Fix a > 1, D > 4g, W > 5, 7 > 3. Choose k so as to satisfy Propositions 4 and 5. Let Zq be 
the set of leaves. The algorithm — a standard cherry picking algorithm — is detailed in Figure 3. 
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Algorithm 

Input: Distance estimates {f(a, &)} ,&e[n]; 
Output: Tree; 

• For h' = 1, . . . , h — 1, 

1. Four-Point Test. Let 

ft/i' = {<? = « fe M : Va, b,c,de Z h , distinct such that FP(g) = 1}. 

2. Cherry Picking. Identify the cherries in 72./,' , that is, those pairs of vertices that only appear on the same 
side of the quartet splits in 72/,'. Let 

- _ f O'+i) O'+i) i 
■Zh'+i — i«i , • • • ,a 2h -( h > +1) }, 

be the parents of the cherries in Zh> 

3. Weight Estimation. For all z e Z/, +1 , 

(a) Let x, y be the children of z. Choose w to be any other vertex in Zh> with §D({a;, y, w}) = 1. 

(b) Compute 

@z,x — 0(x; y, w). 

(c) Repeat the previous step interchanging the role of x and y. 



Figure 3: Algorithm. 
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A Extending to General Trees 



It is possible to generalize the previous arguments to general trees, using a combinatorial algorithm of [DMR09a], 
thereby giving a proof of Theorem 3. To apply the algorithm of [DMR09a] we need to obtain a generalization 
of Proposition 4 for disjoint subtrees in "general position." This is somewhat straightforward and we give a 
quick sketch in this section. 

A.l Basic Definitions 

The algorithm in [DMR09a] is called Blindfolded Cherry Picking. We refer the reader to [DMR09a] for a full 
description of the algorithm, which is somewhat involved. It is very similar in spirit to the algorithm intro- 
duced in Section 3.2, except for complications due to the non-homogeneity of the tree. The proof in [DMR09a] 
is modular and relies on two main components: a distance -based combinatorial argument which remains un- 
changed in our setting; and a statistical argument which we now adapt. The key to the latter is [DMR09a, 
Proposition 4]. Note that [DMR09a, Proposition 4] is not distance -based as it relies on a complex ancestral 
reconstruction function — recursive majority. Our main contribution in this section is to show how this result 
can be obtained using the techniques of the previous sections — leading to a fully distance-based reconstruction 
algorithm. 

In order to explain the complications due to the non-homogeneity of the tree and state our main result, we 
first need to borrow a few definitions from [DMR09a]. 

Basic Definitions. Fix < A < / < g < g* as in Theorem 3. Let T = (V, E, [n],p;r) G Y f ^ 9 be a 
phylogeny with underlying tree T = (V, E). In this section, we sometimes refer to the edge set, vertex set and 
leaf set of a tree V as £(T'), V(T'), and £(T') respectively. 

Definition 9 (Restricted Subtree) Let V' C V be a subset of the vertices ofT. The subtree of T restricted to 
V' is the tree T' obtained by 1) keeping only nodes and edges on paths between vertices in V' and 2) by then 
contracting all paths composed of vertices of degree 2, except the nodes in V'. We sometimes use the notation 
T' = T\yi. See Figure 4 for an example. 

Definition 10 (Edge Disjointness) Denote by PatfiT(x, y) the path (sequence of edges) connecting x to y in 
T. We say that two restricted subtrees T±, T2 ofT are edge disjoint 

Path T (xi,yi) n Path T (x2,2/2) = 0, 

for all xi,y± £ C{T\) and X2, 2/2 £ £(^2)- We say that T±, T 2 are edge sharing if they are not edge disjoint. 
See Figure 5 for an example. 

Definition 11 (Legal Subforest) We say that a tree is a rooted full binary tree if all its internal nodes have 
degree 3 except the root which has degree 2. A restricted subtree T\ of T is a legal subtree of T if it is also a 
rooted full binary tree. We say that a forest 

^={Ti,r 2 ,...}, 

is legal subforest ofT if the T L 's are edge-disjoint legal subtrees ofT. We denote by p{F) the set of roots of T. 

Definition 12 (Dangling Subtrees) We say that two edge-disjoint legal subtrees T\, T2 of T are dangling if 

there is a choice of root for T not in T\ or T 2 that is consistent with the rooting of both T\ and T 2 . See Figure 6 
below for an example where two legal, edge-disjoint subtrees are not dangling. 



20 




21 




Xi 



Figure 6: Basic Disjoint Setup (General). The rooted subtrees Ti, T2 are edge-disjoint but are not assumed to 
be dangling. The white nodes may not be in the restricted subtrees T±, T2. The case w\ = X\ and/or W2 = £2 
is possible. Note that if we root the tree at any node along the dashed path, the subtrees rooted at y\ and y% are 
edge-disjoint and dangling (unlike T\ and Ti). 



Definition 13 (Basic Disjoint Setup (General)) Let T\ = T Xl and T2 = T X2 be two restricted subtrees ofT 
rooted at x\ and X2 respectively. Assume further that T\ and T2 are edge-disjoint, but not necessarily dangling. 
Denote by y L , z L the children of x L in T t , l = 1, 2. Let w t be the node in T where the path between T\ and T2 
meets T L , 1 = 1, 2. Note that w L may not be in T L since T L is restricted, 1 = 1,2. If w L 7^ x L , assume without loss 
of generality that w L is in the subtree ofT rooted at z L , t = 1,2. We call this configuration the Basic Disjoint 
Setup (General). See Figure 6. Let r{T\, T2) be the length of the path between w\ and W2 in the metric r. 



A.2 Deep Distorted Metric 

Our reconstruction algorithm for homogeneous trees (see Section 3) builds the tree level by level and only 
encounters situations where one has to compute the distance between two dangling subtrees (that is, the path 
connecting the subtrees "goes above them"). However, when reconstructing general trees by growing a subfor- 
est from the leaves, more general situations such as the one depicted in Figure 6 cannot be avoided and have to 
be dealt with carefully. 

Hence, our goal in this subsection is to compute the distance between the internal nodes x\ and X2 in the 
Basic Disjoint Setup (General). We have already shown how to perform this computation when T\ and T2 are 
dangling, as this case is handled easily by Proposition 4 (after a slight modification of the distance estimate; 
see below). However, in the general case depicted in Figure 6, there is a complication. When T\ and T2 are not 
dangling, the reconstructed sequences at x\ and X2 are not conditionally independent. But it can be shown that 
for the algorithm Blindfolded Cherry Picking to work properly, we need: 1) to compute the distance between x\ 
and X2 correctly when the two subtrees are close and dangling; 2) detect when the two subtrees are far apart (but 
an accurate distance estimate is not required in that case). This turns out to be enough because the algorithm 
Blindfolded Cherry Picking ensures roughly that close reconstructed subtrees are always dangling. We refer 
the reader to [DMR09a] for details. 

The key point is the following: if one computes the distance between y\ and 2/2 rather than the distance 
between x\ and X2, then the dangling assumption is satisfied (re-root the tree at any node along the path 
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connecting w\ and wi). However, when the algorithm has only reconstructed T\ and T2, we cannot tell which 
pair in {y±, z±} x {1/2 , Z2] is the right one to use for the distance estimation. Instead, we compute the distance 
for all pairs in {yi, z{\ x {2/2, ^2} and the following then holds: in the dangling case, all these distances will 
agree (after subtracting the length of the edges between xi, xi and {yi, z\, 2/2, ^2}); in the general case, at least 
one is correct. This is the basic observation behind the routine DistortedMetric in Figure 7 and the proof 
of Proposition 6 below. We slightly modify the definitions of Section 3. 

Using the notation of Definition 13, fix (ao,&o) £ {yi> z i} x ^2}- For v G V and t G L, denote 
by \£\ v be the graph distance (that is, the number of edges) between v and leaf I. Assume that we are given 
6> e for all e G £(T ao ) U £(T& ). Let a > 1 and A/t = [a log 2 log 2 nj . Imagine (minimally) completing the 
subtrees below ao and 60 with 0-length edges so that the leaves below ao and 60 are at distance at least Ah. For 
x G {a, 6}, denote by xi, . . . , x 2 Ah, the vertices below xo at distance Ah from xo and, for j = 1, . . . , 2 Afe , let 
X, be the leaves of T below Xj. For 1 < j < 2 Ah , we estimate r(ao, bo) as follows 

Note that, because the tree is binary, it holds that 

£ £ 2 -i a 'i°H b % = 2 -|a'u, 2 -|6'i6, =lj 

and we can think of the weights on Aj (similarly for Bf) as resulting from a homogeneous flow ty aj from a,j to 
Aj. Then, the bound on the variance of 



s aj = £ 2->'i^e-i a , 



in Proposition 1 still holds with 



Moreover . is uniformly bounded following an argument identical to (8) in the proof of Lemma 4. For 

(iel and r > 0, let B r (d) be the ball of radius r around d. We define the dense ball around j to be the smallest 
ball around f(aj,bj) containing at least 2/3 of J = {f(aji, 0j')}/=i ( as a multiset). The radius of the dense 
ball around j is 

r* = mf{r : \B r (f( aj , bj)) D J| > ^(2 Aft )| , 

for j = 1, . . . ,2 Ah . We define our estimate of r(ao, 60) to be f'(ao, 60) = f(cij*, bj*), where j* = argmin^r*. 
See Figure 2. For D > 0, W > 5, we define 



SD(a ,6 ) 

and we let 



1 

> 2 



d(a ,b ) 



[f'(a ,b )] A , ifSD(a ,6 ) 
+00, o.w. 



Proposition 6 (Accuracy of DISTORTEDMETRIC) Let a > 1, D > 0, W > 5, 7 > and g < g' < g*. 

Consider the Basic Disjoint Setup ( General) with T = {Ti , T2 } and Q = {yi , z± , yi , £2}- Asswme we are g/vera 
9 e for all e G U £ (T 2 ). Le? T denote ?/ze awJpitf 0/ DistortedMetric in F/gwre 7. 77iere exz'sta k > 0, 

swc/j ?/ja? if the following condition holds: 
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Algorithm DistortedMetric 

Input: Rooted forest T = {Ti, T 2 } rooted at vertices X\,X2\ weights r e , for all e e £(T\) U £(T 2 ); 
Output: Distance T; 

• [Children] Let y L , z L be the children of x u in T for l = 1,2 (if x L is a leaf, set z L = y L = x u )\ 

• [Distance Computations] For all pairs (a, b) <G {y\,z\} x {2/2, z 2 }, compute 

P(a, b) := d(a, 6) - r(a, ari) - r(6, x 2 ); 

• [Multiple Test] If 

max{X>(rJ 1) ,r^ 1) )-I>(ri 2) ,r^ ) ) : 

(r[ l \r¥) G {yi,«i} x {2/ 2 ,z 2 },t= 1,2} = 0, 
return T := z 2 ), otherwise return T := +00 (return T := +00 if any of the distances above is +00). 



Figure 7: Routine DistortedMetric. 

• [Edge Length] It holds that r(e) < g' , Ve G S(T X ), x G Q 5 ; 

• [Sequence Length] 77ie sequence length is k > log K (ra), 
then we have, with probability at least 1 — 0(n~ 7 ), 

T = t(x!,x 2 ) 

under either of the following two conditions: 

1. [Dangling Case] T\ and T2 are dangling and t(T±, T2) < D, or 

2. [Finite Estimate] T < +00. 

The proof, which is a simple combination of the proof of Proposition 4 and the remarks above the statement of 
Proposition 6, is left out. 

Full Algorithm. The rest of the Blindfolded Cherry Picking algorithm is unchanged except for an additional 
step to compute averaging weights as in the algorithm of Section 3. This concludes our sketch of the proof of 
Theorem 3. 



5 For technical reasons explained in [DMR09a], we allow edges slightly longer than the upper bound g. 
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B Proofs 

Proof of Lemma 1: Note that E[F t f] = m {e~ T ^ Q ) Then 



E 



v T F ab v 



m e 



-r{a,b)Q 



iS<E> 



Proof of Lemma 2: By Azuma's inequality we get 

P [t(u,v) > t(u,v) + e] 



i=i 



< 



i=i 

* < E^l - (1 - e -) e -^ogiog(n) 



i=l 



/ ((1 - e ^) e - 51 °s lo s( n )) 2 
- 6XP ^ 2fc(2^) 2 
< n~T, 

for k large enough depending on 5, e, 7. Above, we used that 

T (u,v) = -lnE[cr^]. 
A similar inequality holds for the other direction. ■ 
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Proof of Lemma 3: Assume the first three conditions hold. By Azuma's inequality we get 

W~ 

t(u, v) - t(uo, u) - t(vq, v) < D + In — 
k 

-Vo-X > 2W- l e- D e- T( - uo ' u) - r{vo ' v 
k i=i 

- y"<4 > e"^^ + W - 1 e- D e- T{uo ' u) - T( - vo ' v) 

i=i 



< 



k 

< exp 

< n~ 7 , 



i=i 



lg-£ , e --r(Mo,M)-r(i;o,f 



2k(2u/k) 2 



for re large enough. A similar argument gives the second claim. ■ 

Proof of Proposition 1: We follow the proofs of [EKPSOO, MP03]. Let e, be the unit vector in direction i. Let 
x 6 [n] , then 

e[£ I ^] = ^<**>« 

Therefore, 

| a p ] = eJ p e Ti - p ^ Q v = a p e~ T ^ x \ 

and 



E[S\a p }= 



xe[n] 



e 



p,x 



Up ^ *(X) = Op 

x£[ri\ 



In particular, 



E[s] = J2nm = o. 



For x, y € [n] , let x A y be the meeting point of the paths between p,x,y. We have 

E[a x (Ty] = ^ P[£xAj/ = l]E[(J X (Jy I £ xA y = i] 

= ^ 7r t E[cr x I £ xAl , = i]E[o y I £ xAj/ = l] 



16$ 



,~T(xhy,x) u e -r(xhy,y) 



_ e -r{x,y) 



_ e —r(x,i/)_ 
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Then 

\ar[S] = E[S 2 } 

x,y€[n] 
x,ye[n] 

For e G -E, let e = (e-j-, ej_) where e-f is the vertex closest to p. Then, by a telescoping sum, for u G V 

eGPath(p,u) egPath(p,u) egPath(p,M) 

= e 2r(p ' u) - 1, 



and therefore 



E[S 2 } = Yl ^(x)^(y)e 2T{v ' xAy) 

x,ye[n] 

= Yl *(*)*(y) (1+ £ W 

a;,!/£[n] \ egPath(p,xAj/) 

= l + ^tf^e) £ l{e G Path(p,x Ay)}^(x)^(y) 

eeS x,j/e[n] 

= 1 + £i? p (e)*(e) 2 . 



Proof of Lemma 4: From (6), we have 

k p ,* < ]T(i- e - 2 ^ 



e 2(h-i)g 



2 2(h-i) 
i=0 

ft 

< ^ e 2jg e -(21n V / 2)j 

J'=l 
ft 

J'=l 
+oo 



where recall that g* = In y/2. 
Proof of Proposition 2: Let 



Z = {a j } 2 t h 1 U{b j } 2 t h 1 ., 
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and 

£ = {(4)ii>- 

Forj = l,...,2 A/ \ let 



/L^ a 3 °3 

1=1 

Note that 

by Lemma 1. For i = 1, . . . , k, j = 1, . . . , 2 A?i , and x G {a, 6}, let 

By the Markov property, it follows that, conditioned on £, 

{<■ : i = l,...,fc,j = l,...,2 A \ xe{a,b}}, 
are mutually independent. Moreover, by Proposition 1, we have 

m J \s]=4., 

and 



E[(alf\£]<2n- 1 - 



1 _ e -2(g*-g) ■ 
Therefore, for any ( > there exists k > 1 such that 



3 r(ao,a,)+r( b o,^) f ^ E[a* J £]E[<7* . | £] 



e 

\ i=i 

-(f(a j ,fe J )-r(a 3 ,ao)-T(6j,6o)) 



and 



Var[e" f | 5] = e 2r(ao,a J -)+2T(6o,6,-) f J_ ^ Vai^.aj . | £} 



v A; 2 

< e 2 ^<^)+M^) ( 1 J>[(^.) 2 | £]E[(^) 2 | f ] 



k 2 
i=i 



e 2r(ao,a : ,-)+2T(6o,fei) 
< 



< 



jfe ^1-6-2(9* - 9 ) 

e 49Lalog 2 log 2 nj / o > 2 



fc Vl-e- 2 (f*-f) 

< C- 
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Take k large enough such that Lemma 2 holds for diameter 2 g [a log 2 log 2 n\+D, precision e/6 and failure 
probability 0(n~( 7+1 )). Then, we have 



\f{aj,bj) - r{a 3 ,b 3 )\ < e/6, Vj = 1, . . . , 2 Ah , 
with probability 1 - 0(n^). Let £' be the event that (9) holds. Note that 

Let 

£ ' = min{(e £ / 3 - e^ 6 )^, (e^ 6 - e - £ / 3 ) e - D }. 
By Chebyshev's inequality, we have that 

1 



(9) 



r(aj,bj) < r(a , b ) - -e\£ D £' 
< 



< 
< 
< 
< 



e -f(aj,bj) > e -T(a ,b )+e/3 I £ p £' 

e -f{aj,bj) _ £j e — f(oj,6j) | £ p| £'] > j e -r(a ,6o)+e/3 _ e ~(f (oj,6j)-r(a J -,ao)-r(6j,6o))j | £ f| £' 

e -f{aj,bj) _ £[ e — f(oj,6j) | £ p| £'] > g-T(a ,6 )[g£/3 _ ^{a^b^-T (oj,6j)] | £ f| £' 

e -r(*i,bj) - R[ e -H°jM I £ n £'] > e - D [e e/3 - e e/6 ] | £ n £' 

e-fteM _ E[e" f te> 6 *) | £ n £'] > e' | £ n £' 



< 1/12, 

for £ small enough. A similar argument holds for 



r(aj, &j) > r(ao, bo) + o £ I £ n £' 



< 1/12. 



Conditioned on £ n £', 



2&h 

C = 2- Ah J2 1 { bj) - r(ao, Ml < ^} 



is a sum of independent {0, l}-variables with average at least 5/6. By Azuma's inequality, we have 

P[£ < 2/3 | £ n £'] < P[£ -E[£|£n£'] < -l/6|£n£'] 

/ (1/6) 2 
< exp ' 



2( 2 -Ah)2 2 A/ l 

< exp(-0(log») 

< 0(n -7 ), 

where we used that a > 1. 

This implies that with (unconditional) probability at least 1 — 0(n~ 7 ) 



| |t(oj, 6j) -r(ao, Ml < ^} 



> g(2* h ). 
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In particular, there exists j such that 



and for all j such that 
we have that 

Finally, we get that 



f : \T(aji,bj>) -T(aj,bj)\ < j^j 



> §(2**), 



T(aj,bj) - r(a ,6 )| > £, 



f : |r(oi/,6j/)-T(aj,6j)| < 



< ^(2 Afe ). 



|r'(ao,6o) - r(a ,6o)| < £■ 



Proof of Proposition 3: The proof is similar to the proof of Lemma 3 and Proposition 2. ■ 

Proof of Proposition 4: We let e < A/2. 

The first part of the proposition follows immediately from Proposition 2 and the second part of Propo- 
sition 3. Note that Propositions 2 and 3 are still valid when h' < [alog 2 log 2 nJ if we take instead Ah = 
mm{h', [a log 2 log 2 nj }. Indeed, in that case there is no need to perform an implicit ancestral sequence recon- 
struction and the proofs of the lemmas follow immediately from Lemmas 2 and 3. 

For the second part, choose k so as to satisfy the conditions of Proposition 2 with diameter D + In W and 
apply the first part of Proposition 3. ■ 



Proof of Proposition 5: The proof follows immediately from Proposition 4 and the remark above the statement 
of Proposition 5. ■ 

Proof of Theorem 4: The proof of Theorem 4 follows from Propositions 4 and 5. Indeed, at each level h', we 
are guaranteed by the above to compute a distorted metric with a radius large enough to detect all cherries on 
the next level using four-point tests. The proof follows by induction. ■ 
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